Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[rag evals][1/n] refactor base scoring fn & data schema check #664

Merged
merged 12 commits into from
Jan 2, 2025

Conversation

yanxi0830
Copy link
Contributor

@yanxi0830 yanxi0830 commented Dec 19, 2024

What does this PR do?

Test Plan

pytest -v -s -m llm_as_judge_scoring_together_inference scoring/test_scoring.py --judge-model meta-llama/Llama-3.2-3B-Instruct
pytest -v -s -m basic_scoring_together_inference scoring/test_scoring.py
pytest -v -s -m braintrust_scoring_together_inference scoring/test_scoring.py
image
pytest -v -s -m meta_reference_eval_together_inference eval/test_eval.py
pytest -v -s -m meta_reference_eval_together_inference_huggingface_datasetio eval/test_eval.py
image

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Ran pre-commit to handle lint / formatting issues.
  • Read the contributor guideline,
    Pull Request section?
  • Updated relevant documentation.
  • Wrote necessary unit or integration tests.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Dec 19, 2024
@yanxi0830 yanxi0830 changed the title [rag eval][1/n] refactor base scoring fn & add more braintrust evaluators [rag eval][1/n] refactor base scoring fn & data schema check Dec 19, 2024
@yanxi0830 yanxi0830 marked this pull request as ready for review December 20, 2024 00:10
@yanxi0830 yanxi0830 changed the title [rag eval][1/n] refactor base scoring fn & data schema check [rag evals][1/n] refactor base scoring fn & data schema check Dec 20, 2024
@@ -13,12 +13,51 @@

class BaseScoringFn(ABC):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still don't understand why we have these base classes because we have already declared what our impls need to do in terms of datatypes in our APIs. so the datatype for a scoring function already exists. Let's say I am implementing a new scoring function -- why do I need another base class and inherit from there? If there is some utilities I need for implementing the functions, they would be just utils / free functions or in the worst case, some mixins.

Can you explain the need for base classes please? I am very allergic to inheritance as you and @raghotham knows :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This class BaseScoringFn(ABC): is mostly separated out based on feedback from Tejas for his use case without
registration (syncing with you offline).

For our llama-stack implementations, I agree we don't need this separate BaseScoringFn, and could just use RegisteredBaseScoringFn as mixins.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yanxi0830 I see, interesting. Thanks for the explanation. We can keep this as is for now.

@@ -0,0 +1,93 @@
# Copyright (c) Meta Platforms, Inc. and affiliates.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can datasetio abstract schema validation instead of creating mixins to be used by scoring and evals?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@raghotham can you explain what you mean by this with an example? how do you imagine this validation happening specifically?

Copy link
Contributor

@ashwinb ashwinb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lg

# What does this PR do?

- add more braintrust scoring functions for RAG eval
- add tests for evaluating against context

## Test Plan

```
pytest -v -s -m braintrust_scoring_together_inference scoring/test_scoring.py
```
<img width="850" alt="image"
src="https://github.com/user-attachments/assets/2f8f0693-ea13-422c-a183-f798faf86433"
/>


**Example Output**
- https://gist.github.com/yanxi0830/2acf3b8b3e8132fda2a48b1f0a49711b

<img width="827" alt="image"
src="https://github.com/user-attachments/assets/9014b957-107c-4c23-bbc0-812cbd0b16da"
/>

<img width="436" alt="image"
src="https://github.com/user-attachments/assets/21e9da17-f426-49b2-9113-855cab7b3d40"
/>




## Sources

Please link relevant resources if necessary.


## Before submitting

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Ran pre-commit to handle lint / formatting issues.
- [ ] Read the [contributor
guideline](https://github.com/meta-llama/llama-stack/blob/main/CONTRIBUTING.md),
      Pull Request section?
- [ ] Updated relevant documentation.
- [ ] Wrote necessary unit or integration tests.
@yanxi0830 yanxi0830 merged commit 3a269c4 into main Jan 2, 2025
2 checks passed
@yanxi0830 yanxi0830 deleted the rag_scoring_fn_1 branch January 2, 2025 19:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Meta Open Source bot.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants